Working with data

What Are Data?

Data are the core of everything that we do in statistical analysis. Data come in many forms, and I don’t just mean .csv, .xls, .sav, etc. Data can be wide, long, documented, fragmented, messy, and about anything else that you can imagine.

Although data could arguably be more means than end in psychology, the importance of understanding the structure and format of your data cannot overstated. Failure to understand your data could end in improper techniques and flagrantly wrong inferences at worst. This is especially important for longitudinal data. We will discuss many aspects of data handling. One thing to note is that this is just ONE WAY to do it. There are many equivalent.

Why are we thinking about data? Because 80%, maybe more, of your time spent with “analysis” is spent getting data in order and setting up your model of interest.

Wide vs long

Aka multivariate vs stacked; person vs person period untidy vs tidy*

Long is what MLM, ggplot2 and tidyverse packages expect whereas SEM and a lot of descriptive are calculated using wide dataframes.

In this figure X, Y, and Z could represent different waves of collection. For each wave we have some value for each of the two people in the dataset. In the long format each person has each wave as a separate row. In the wide each person has their data on a single row.

We will be working with long data for the first half of the class and wide data the second. However, even during the first half we will need to switch back and forth to make sure we can calculate certain values.

tidyr

The best package to go back and forth between long and wide is the tidyr package, which is part of the tidyverse. Here we will walk through some examples of the primary functions, pivot_wider and pivot_longer

For longitudinal/repeated measures data, each row is an observation. Each person will have multiple rows. You can grab some example data from the class’s github

data <- read.csv("https://raw.githubusercontent.com/josh-jackson/longitudinal-2021/master/example.csv")
example <- data %>% 
  select(ID, wave, group, DAN)
head(example)
  ID wave group    DAN
1  6    1    PD 0.1619
2  6    2    PD 0.1677
3  6    3    PD 0.2153
4 29    1    PD 0.1749
5 29    2    PD 0.1356
6 34    1  CTRL 0.1659

pivot_wider

The pivot_wider() function takes two arguments: names_from which is the variable whose values will be converted to column names and values_from whose values will be cell values.

wide.ex <- example %>% 
  pivot_wider(names_from = wave, values_from = DAN) 
wide.ex
# A tibble: 91 x 6
      ID group    `1`    `2`    `3`   `4`
   <int> <chr>  <dbl>  <dbl>  <dbl> <dbl>
 1     6 PD    0.162  0.168   0.215    NA
 2    29 PD    0.175  0.136  NA        NA
 3    34 CTRL  0.166  0.140  NA        NA
 4    36 CTRL  0.152  0.205  NA        NA
 5    37 PD    0.219  0.158   0.259    NA
 6    48 PD    0.130  0.270   0.248    NA
 7    53 CTRL  0.211  0.152  NA        NA
 8    54 PD    0.220  0.152   0.192    NA
 9    58 PD    0.380  0.215   0.204    NA
10    61 PD    0.0818 0.0628 NA        NA
# … with 81 more rows

pivot_longer

Going back to long:

The pivot_longer function takes three arguments: cols is a list of columns that are to be collapsed. The columns can be referenced by column number or column name. names_to is the name of the new column which will combine all column names. This is up to you to decide what the name is. values_to is the name of the new column which will combine all column values associated with each variable combination.

long.ex <- wide.ex %>% 
  pivot_longer(cols = '1':'4', 
               names_to = "wave", 
               values_to = "DAN")
long.ex
# A tibble: 364 x 4
      ID group wave     DAN
   <int> <chr> <chr>  <dbl>
 1     6 PD    1      0.162
 2     6 PD    2      0.168
 3     6 PD    3      0.215
 4     6 PD    4     NA    
 5    29 PD    1      0.175
 6    29 PD    2      0.136
 7    29 PD    3     NA    
 8    29 PD    4     NA    
 9    34 CTRL  1      0.166
10    34 CTRL  2      0.140
# … with 354 more rows

Seperate and Unite

Many times datasets are, for a lack of a better term, messy. We will talk more about the upfront work later to make sure you dont have messy data. However, if you do have messy data there are a number of helpful functions to tidy-up your data.

One common way to represent longitudinal data is to name the variable with a wave signifier.

wide<- tribble(
  ~ID, ~ext_1, ~ext_2, ~ext_3,
  1, 4, 4,4,
  2, 6, 5,4,
  3, 4,5,6
)
wide
# A tibble: 3 x 4
     ID ext_1 ext_2 ext_3
  <dbl> <dbl> <dbl> <dbl>
1     1     4     4     4
2     2     6     5     4
3     3     4     5     6

If we went and tried to pivot_longer we’d end up with

wide %>% 
  pivot_longer(cols = ext_1:ext_3, names_to = "time", values_to = "EXT")
# A tibble: 9 x 3
     ID time    EXT
  <dbl> <chr> <dbl>
1     1 ext_1     4
2     1 ext_2     4
3     1 ext_3     4
4     2 ext_1     6
5     2 ext_2     5
6     2 ext_3     4
7     3 ext_1     4
8     3 ext_2     5
9     3 ext_3     6

The time column is now specific to ext, which is a problem if I have more than one variable that I am pivoting. But, we will end up using wave as our time variable in our model, and time will have to be numeric. So how can we go ahead and separate out the ext part?

One way is to use the separate function

long<- wide %>% 
  pivot_longer(cols = ext_1:ext_3, 
               names_to = "time", 
               values_to = "EXT") %>% 
  separate(time, into = c("variable", "time"))
long
# A tibble: 9 x 4
     ID variable time    EXT
  <dbl> <chr>    <chr> <dbl>
1     1 ext      1         4
2     1 ext      2         4
3     1 ext      3         4
4     2 ext      1         6
5     2 ext      2         5
6     2 ext      3         4
7     3 ext      1         4
8     3 ext      2         5
9     3 ext      3         6

In terms of setting up your data, it is often helpful to include markers that separate parts of the variable eg "_" or “.” A variable that is ext_1 is easier to separate than ext1.

Note, also that the time column is a character rather than numeric. We need to change this so as to use time continuously in our models. There are a few ways to do it, but this is perhaps the most straightforward.

long$time <- as.numeric(long$time)
long
# A tibble: 9 x 4
     ID variable  time   EXT
  <dbl> <chr>    <dbl> <dbl>
1     1 ext          1     4
2     1 ext          2     4
3     1 ext          3     4
4     2 ext          1     6
5     2 ext          2     5
6     2 ext          3     4
7     3 ext          1     4
8     3 ext          2     5
9     3 ext          3     6

However, something that is a little more elegant is to do both the separating AND the making into numeric in the original pivot_longer function

names_prefix omits what is in there from the new cell names. Previously we had ext_1, ext_2, etc, which we had to seperate with a different function, but this does it within pivot_longer

wide %>% 
  pivot_longer(cols = ext_1:ext_3, 
               names_to = "time", 
               values_to = "EXT", 
               names_prefix = "ext_") 
# A tibble: 9 x 3
     ID time    EXT
  <dbl> <chr> <dbl>
1     1 1         4
2     1 2         4
3     1 3         4
4     2 1         6
5     2 2         5
6     2 3         4
7     3 1         4
8     3 2         5
9     3 3         6

names_transform does any transformations within the variables. Here instead of a separate call, we can make our variables numeric.

wide %>% 
  pivot_longer(cols = ext_1:ext_3, 
               names_to = "time", 
               values_to = "EXT", 
               names_prefix = "ext_", 
               names_transform = list(time = as.numeric)) 
# A tibble: 9 x 3
     ID  time   EXT
  <dbl> <dbl> <dbl>
1     1     1     4
2     1     2     4
3     1     3     4
4     2     1     6
5     2     2     5
6     2     3     4
7     3     1     4
8     3     2     5
9     3     3     6

Another common problem that we often face is the need to unite two variables into one. Enter, the creatively titled unite function. Sometimes this happens when our time metric is entered in seperate columns.

df <- tibble(
      ID = c(1,   2,  3),
      year  = c(2020,  2020, 2020),
      month  = c(1,  1, 1),
      day  = c(1,  1, 1),
      hour   = c(4,  2, 5),
      min   = c(55, 17, 23))
df
# A tibble: 3 x 6
     ID  year month   day  hour   min
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1  2020     1     1     4    55
2     2  2020     1     1     2    17
3     3  2020     1     1     5    23

To combine them into one time metric

df %>% 
  unite(col = time, 5:6, sep=":", remove =TRUE)
# A tibble: 3 x 5
     ID  year month   day time 
  <dbl> <dbl> <dbl> <dbl> <chr>
1     1  2020     1     1 4:55 
2     2  2020     1     1 2:17 
3     3  2020     1     1 5:23 

date time metrics

A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). These are called POSIXct in R.

today()
[1] "2021-01-29"
now()
[1] "2021-01-29 12:34:45 CST"

Bringing these into R from some outside place (excel, spss) can lead to confusion, as they can be formatted differently

ymd("2017-01-31")
[1] "2017-01-31"
mdy("January 31st, 2017")
[1] "2017-01-31"
dmy("31-Jan-2017")
[1] "2017-01-31"

You can create these relatively straight forwardly…by hand

ymd_hms("2017-01-31 20:11:59")
[1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
[1] "2017-01-31 08:01:00 UTC"

Or you can use existing columns variables. This is where the lubridate package comes in handy

df %>% 
  mutate(t_1 = make_datetime(year, month, day, hour, min))
# A tibble: 3 x 7
     ID  year month   day  hour   min t_1                
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm>             
1     1  2020     1     1     4    55 2020-01-01 04:55:00
2     2  2020     1     1     2    17 2020-01-01 02:17:00
3     3  2020     1     1     5    23 2020-01-01 05:23:00

Note the t_1 variable is a POSIXct variable type. Once in this format it is much easier to manipulate and work with dates and times.

Projects and Rmarkdown

As with any project, but especially for longitudinal data, one of the most important aspects of data analysis is A. not losing track of what you did and B. being organized. This is much much much harder than said. I find using a combination of 1. rstudio projects 2. git and 3. codebooks are helpful in accomplishing these two goals. We will talk about #1 and #2 but I also encourage you to read about git. These are not the only way to do these sorts of analyses but I feel that exposure to them is helpful, as often in the social sciences these sort of decisions are not discussed.

What these help to do is create a chain of processing where you start with RAW data and end up with the cleaned data. Importantly you can always start over from the raw data. This is important for people wanting to reproduce your findings and or your future self figuring out where a certain variable came from.

We start creating the chain of processing by documenting all of your code, all of it inside. To do so we will be using rmarkdown documents, as the language is easier than LaTeX and more helpful than plaintext.

When I create an rmarkdown document for my own research projects, I always start by setting up 3 components:

  1. Packages
  2. Codebook(s)
  3. Data

Below, we will step through each of these separately, setting ourselves up to (hopefully) flawlessly communicate with R and our data. Note that you do not need to use rmarkdown but I think rmarkdown is much more useful than standard .R syntax.

Packages

Packages seems like the most basic step, but it is actually very important. Depending on what gets loaded you might overwrite functions from other packages.(Note: I will often reload or not follow this advice within lectures for didactic reasons, choosing to put library calls above the code)

# load packages
library(psych)
library(plyr)
library(tidyverse)

Codebook

The second step is a codebook. Arguably, this is the first step because you should create the codebook long before you open R and load your data.

Why a codebook? Well, because you typically have a lot of variables and you will not be able to remember all the details that go into each one of them (rating scale, what the actual item was, was it coded someway, etc). This is especially true now that data are being collected online, which often provides placeholder variable names that then need to be processed somehow. This codebook will serve as a means to document RAW code. It will also allow us to automate some tasks that are somewhat cumbersome, facilitate open data practices, and efficiently see what variables are available. Ultimately, we want to be able to show how we got from the start, with the messy raw data, to our analyses and results at the end? A codebook makes this easier.

To illistrate, we are going to using some data from the German Socioeconomic Panel Study (GSOEP), which is an ongoing Panel Study in Germany. Note that these data are for teaching purposes only, shared under the license for the Comprehensive SOEP teaching dataset, which I, as a contracted SOEP user, can use for teaching purposes. These data represent select cases from the full data set and should not be used for the purpose of publication. The full data are available for free at https://www.diw.de/en/diw_02.c.222829.en/access_and_ordering.html.

For this tutorial, I created the codebook for you, and included what I believe are the core columns you may need. Some of these columns will not be particularly helpful for this dataset. For example, many of you likely work with datasets that have only a single file while others work with datasetsspread across many files (e.g., different waves, different sources). As a result, the “dataset” column of the codebook may only have a single value whereas for others it may have multiple.

Here are my core columns that are based on the original data:

  1. dataset: this column indexes the name of the dataset that you will be pulling the data from. This is important because we will use this info later on (see purrr tutorial) to load and clean specific data files. Even if you don’t have multiple data sets, I believe consistency is more important and suggest using this.

  2. old_name: this column is the name of the variable in the data you are pulling it from. This should be exact. The goal of this column is that it will allow us to select() variables from the original data file and rename them something that is more useful to us. If you have worked with qualtrics (really any data) you know why this is important.

  3. item_text: this column is the original text that participants saw or a description of the item.

  4. scale: this column tells you what the scale of the variable is. Is it a numeric variable, a text variable, etc. This is helpful for knowing the plausible range.

  5. reverse: this column tells you whether items in a scale need to be reverse coded. I recommend coding this as 1 (leave alone) and -1 (reverse) for reasons that will become clear later.

  6. mini: this column represents the minimum value of scales that are numeric. Leave blank otherwise.

  7. maxi: this column represents the maximumv alue of scales that are numeric. Leave blank otherwise.

  8. recode: sometimes, we want to recode variables for analyses (e.g. for categorical variables with many levels where sample sizes for some levels are too small to actually do anything with it). I use this column to note the kind of recoding I’ll do to a variable for transparency.

Here are additional columns that will make our lives easier or are applicable to some but not all data sets:

  1. category: broad categories that different variables can be put into. I’m a fan of naming them things like “outcome”, “predictor”, “moderator”, “demographic”, “procedural”, etc. but sometimes use more descriptive labels like “Big 5” to indicate the model from which the measures are derived.

  2. label: label is basically one level lower than category. So if the category is Big 5, the label would be, or example, “A” for Agreeableness, “SWB” for subjective well-being, etc. This column is most important and useful when you have multiple items in a scales, so I’ll typically leave this blank when something is a standalone variable (e.g. sex, single-item scales, etc.).

  3. item_name: This is the lowest level and most descriptive variable. It indicates which item in scale something is. So it may be “kind” for Agreebleness or “sex” for the demographic biological sex variable.

  4. year: for longitudinal data, we have several waves of data and the name of the same item across waves is often different, so it’s important to note to which wave an item belongs. You can do this by noting the wave (e.g. 1, 2, 3), but I prefer the actual year the data were collected (e.g. 2005, 2009, etc.) if that is appropriate. See Lecture #1 on discussion of meaningful time metrics. Note that this differs from that discussion in your codebook you want to describe how you collected the data, not necessarily how you want to analyze the data.

  5. new_name: This is a column that brings together much of the information we’ve already collected. It’s purpose is to be the new name that we will give to the variable that is more useful and descriptive to us. This is a constructed variable that brings together others. I like to make it a combination of “category”, “label”, “item_name”, and year using varying combos of "_" and “.” that we can use later with tidyverse functions. I typically construct this variable in Excel using the CONCATENATE() function, but it could also be done in R. The reason I do it in Excel is that it makes it easier for someone who may be reviewing my codebook.

There is a separate discussion to be had on naming conventions for your variables, but the important idea to remember is that names convey important information and we want to use this information later on to make our life easier. By coding these variables using this information AND systematically using different separators we can accomplish this goal.

  1. meta: Some datasets have a meta name, which essentially means a name that variable has across all waves to make it clear which variables are the same. They are not always useful as some data sets have meta names but no great way of extracting variables using them. But they’re still typically useful to include in your codebook regardless.

These are just suggestions, but after working with many longitudinal datasets I will say all of them are horrible in some way. Doing this makes them less horrible. Is it some upfront work? Yes. Will it ultimately save you time? Yes. Also, if you know this prior to runnign a study you are making some sort of code book anyways, right, right? Might as well kill two birds with one stone.

You can make the codebook anyway you want, but the two best options are miscrosoft excel and google pages. Not because they are necessarily the best functioning but because they are relatively ubiquitous and are easy to share.

We will create a codebook but then bring the codebook into R via turning it into a csv. You can rethink the codebook as a way of coding prior to putting anything into R.

Below, I’ll load in the codebook we will use for this study, which will include all of the above columns.

codebook <- read.csv("https://raw.githubusercontent.com/josh-jackson/longitudinal-2021/master/codebook.csv")

codebook <- codebook %>% 
    mutate(old_name = str_to_lower(old_name))

head(codebook)
  dataset old_name                item_text
1           persnr Never Changing Person ID
2             hhnr             household ID
3   ppfad  gebjahr            Year of Birth
4   ppfad      sex                      Sex
5      vp  vp12501          Thorough Worker
6      zp  zp12001          Thorough Worker
                                                                                                                                                                                                                                                                                    scale
1                                                                                                                                                                                                                                                                                        
2                                                                                                                                                                                                                                                                                        
3                                                                                                                                                                                                                                                                                 numeric
4 \n1 [1] Male\n2 [2] Female\n-1 [-1] No Answer\n-2 [-2] Does not apply\n-3 [-3] Answer improbable\n-4 [-4] Inadmissible multiple response\n-5 [-5] Not included in this version of the questionnaire\n-6 [-6] Version of questionnaire with modified filtering\n56186 57630 24 0 0 0 0 0
5                                                                                                                                                                                                                                                                                        
6                                                                                                                                                                                                                                                                                        
     category label item_name year               new_name reverse
1  Procedural             SID    0        Procedural__SID       1
2  Procedural       household    0  Procedural__household       1
3 Demographic             DOB    0       Demographic__DOB       1
4 Demographic             Sex    0       Demographic__Sex       1
5       Big 5     C  thorough 2005 Big 5__C_thorough.2005       1
6       Big 5     C  thorough 2009 Big 5__C_thorough.2009       1
  mini maxi recode
1   NA   NA       
2   NA   NA       
3   NA   NA       
4   NA   NA       
5    1    7       
6    1    7       

Data

First, we need to load in the data. We’re going to use three waves of data from the German Socioeconomic Panel Study, which is a longitudinal study of German households that has been conducted since 1984. We’re going to use more recent data from three waves of personality data collected between 2005 and 2013.

Note: we will be using the teaching set of the GSOEP data set. I will not be pulling from the raw files as a result of this. I will also not be mirroring the format that you would usually load the GSOEP from because that is slightly more complicated and somethng we will return to in a later tutorial after we have more skills. I’ve left that code for now, but it won’t make a lot of sense right now.

This code below shows how I would read in and rename a wide-format data set using the codebook I created.

old.names <- codebook$old_name # get old column names
new.names <- codebook$new_name # get new column names

soep <- read.csv("https://raw.githubusercontent.com/josh-jackson/longitudinal-2021/master/soepdata.csv")

 soep <-  soep %>% # read in data
  dplyr::select(old.names) %>% # select the columns from our codebook
  setNames(new.names) # rename columns with our new names
paged_table(soep)

Clean Data

Recode Variables

Many of the data we work with have observations that are missing for a variety of reasons. In R, we treat missing values as NA, but many other programs from which you may be importing your data may use other codes (e.g. 999, -999, etc.). Large panel studies tend to use small negative values to indicate different types of missingness. This is why it is important to note down the scale in your codebook. That way you can check which values may need to be recoded to explicit NA values.

In the GSOEP, -1 to -7 indicate various types of missing values, so we will recode these to NA. To do this, we will use mapvalues(), from the plyr package. In later tutorials where we read in and manipulate more complex data sets, we will use mapvalues() a lot.

Below we are taking the dataset soep and saying we are going to mutate all of our variables, making sure they are all numeric (as the code expects numeric), then we get to the mapvalues function:

mapvalues takes 3 key arguments: (1) the variable you are recoding. Below that is indicated by “.” which is shorthand for the data that was piped in.

  1. a vector of initial values from which you want to change. Here we indicae a sequence of values from -1 to -7, which correspond to the missing values used by GSOEP. Other datasets may use -999, for example.

  2. recode your values in (2) to new values in the same order as the old values. Here we have NA (the way R treats missing data) repeated 7 times (to correspond to -1, -2,…)

It is also helpful to turn off warnings if some levels are not in your data (warn_missing = F).

soep <- soep %>%
  mutate_all(~as.numeric(mapvalues(., from = seq(-1,-7, -1), 
                to = rep(NA, 7), warn_missing = F)))
paged_table(soep)

mapvalues technically is depreciated and not currently worked on. This is fine, but a newer replacement for it is using dplyr’s recode. However, it is does not work well vectorized names so it isn’t as useful for our purposes.

Reverse-Scoring

Many scales we use have items that are positively or negatively keyed. High ratings on positively keyed items are indicative of being high on a construct. In contrast, high ratings on negatively keyed items are indicative of being low on a construct. Thus, to create the composite scores of constructs we often use, we must first “reverse” the negatively keyed items so that high scores indicate being higher on the construct.

There are a few ways to do this in R. Below, I’ll demonstrate how to do so using the reverse.code() function in the psych package in R. This function was built to make reverse coding more efficient (i.e. please don’t run every item that needs to be recoded with separate lines of code!!).

Before we can do that, though, we need to restructure the data a bit in order to bring in the reverse coding information from our codebook.

head(soep) 
  Procedural__SID Procedural__household Demographic__DOB
1             901                    94             1951
2            1202                   124             1913
3            2301                   230             1946
4            2302                   230             1946
5            2304                   230             1978
6            2305                   230             1946
  Demographic__Sex Big 5__C_thorough.2005 Big 5__C_thorough.2009
1                2                      6                     NA
2                2                      7                     NA
3                1                      7                      6
4                2                      6                      7
5                1                      5                     NA
6                2                     NA                     NA
  Big 5__C_thorough.2013 Big 5__E_communic.2005
1                      7                      4
2                     NA                      6
3                      6                      6
4                      7                      5
5                     NA                      7
6                     NA                     NA
  Big 5__E_communic.2009 Big 5__E_communic.2013 Big 5__A_coarse.2005
1                      4                      5                    4
2                     NA                     NA                    1
3                      4                      5                    5
4                      5                      5                    6
5                     NA                     NA                    4
6                     NA                     NA                   NA
  Big 5__A_coarse.2009 Big 5__A_coarse.2013 Big 5__O_original.2005
1                    3                    4                      5
2                   NA                   NA                      6
3                    4                    2                      5
4                    5                    5                      5
5                   NA                   NA                      7
6                   NA                   NA                     NA
  Big 5__O_original.2009 Big 5__O_original.2013 Big 5__N_worry.2005
1                      4                      3                   6
2                     NA                     NA                   6
3                      5                      5                   4
4                      5                      6                   3
5                     NA                     NA                   4
6                     NA                     NA                  NA
  Big 5__N_worry.2009 Big 5__N_worry.2013 Big 5__A_forgive.2005
1                   6                   6                     5
2                  NA                  NA                    NA
3                   6                   3                     6
4                   3                   3                     3
5                  NA                  NA                     7
6                  NA                  NA                    NA
  Big 5__A_forgive.2009 Big 5__A_forgive.2013 Big 5__C_lazy.2005
1                     5                     4                  4
2                    NA                    NA                  1
3                     7                     6                  4
4                     4                     3                  2
5                    NA                    NA                  5
6                    NA                    NA                 NA
  Big 5__C_lazy.2009 Big 5__C_lazy.2013 Big 5__E_sociable.2005
1                  4                  3                      5
2                 NA                 NA                      7
3                  2                  4                      3
4                  4                  2                      5
5                 NA                 NA                      4
6                 NA                 NA                     NA
  Big 5__E_sociable.2009 Big 5__E_sociable.2013
1                      5                      4
2                     NA                     NA
3                      4                      6
4                      5                      4
5                     NA                     NA
6                     NA                     NA
  Big 5__O_artistic.2005 Big 5__O_artistic.2009
1                      3                      4
2                      5                     NA
3                      5                      6
4                      7                      5
5                      6                     NA
6                     NA                     NA
  Big 5__O_artistic.2013 Big 5__N_nervous.2005 Big 5__N_nervous.2009
1                      3                     3                     5
2                     NA                     6                    NA
3                      5                     5                     5
4                      7                     2                     3
5                     NA                     5                    NA
6                     NA                    NA                    NA
  Big 5__N_nervous.2013 Big 5__C_efficient.2005
1                     2                       5
2                    NA                       4
3                     2                       6
4                     2                       6
5                    NA                       5
6                    NA                      NA
  Big 5__C_efficient.2009 Big 5__C_efficient.2013
1                       6                       5
2                      NA                      NA
3                       7                       6
4                       7                       6
5                      NA                      NA
6                      NA                      NA
  Big 5__E_reserved.2005 Big 5__E_reserved.2009
1                      6                      6
2                      7                     NA
3                      6                      4
4                      4                      6
5                      4                     NA
6                     NA                     NA
  Big 5__E_reserved.2013 Big 5__A_friendly.2005
1                      6                      5
2                     NA                      7
3                      5                      6
4                      5                      5
5                     NA                      6
6                     NA                     NA
  Big 5__A_friendly.2009 Big 5__A_friendly.2013 Big 5__O_imagin.2005
1                      5                      6                    4
2                     NA                     NA                    6
3                      6                      6                    3
4                      3                      2                    3
5                     NA                     NA                    7
6                     NA                     NA                   NA
  Big 5__O_imagin.2009 Big 5__O_imagin.2013 Big 5__N_dealStress.2005
1                    4                    4                        3
2                   NA                   NA                        3
3                    4                    5                        4
4                    3                    4                        6
5                   NA                   NA                        6
6                   NA                   NA                       NA
  Big 5__N_dealStress.2009 Big 5__N_dealStress.2013
1                        3                        5
2                       NA                       NA
3                        5                        6
4                        3                        4
5                       NA                       NA
6                       NA                       NA
  Life Event__ChldBrth.2005 Life Event__ChldBrth.2006
1                        NA                        NA
2                        NA                        NA
3                        NA                        NA
4                        NA                        NA
5                        NA                        NA
6                        NA                        NA
  Life Event__ChldBrth.2007 Life Event__ChldBrth.2008
1                        NA                        NA
2                        NA                        NA
3                        NA                        NA
4                        NA                        NA
5                        NA                        NA
6                        NA                        NA
  Life Event__ChldBrth.2009 Life Event__ChldBrth.2010
1                        NA                        NA
2                        NA                        NA
3                        NA                        NA
4                        NA                        NA
5                        NA                        NA
6                        NA                        NA
  Life Event__ChldBrth.2011 Life Event__ChldBrth.2012
1                        NA                        NA
2                        NA                        NA
3                        NA                        NA
4                        NA                        NA
5                        NA                        NA
6                        NA                        NA
  Life Event__ChldBrth.2013 Life Event__ChldBrth.2014
1                        NA                        NA
2                        NA                        NA
3                        NA                        NA
4                        NA                        NA
5                        NA                        NA
6                        NA                        NA
  Life Event__ChldBrth.2015 Life Event__ChldMvOut.2005
1                        NA                         NA
2                        NA                         NA
3                        NA                         NA
4                        NA                         NA
5                        NA                         NA
6                        NA                         NA
  Life Event__ChldMvOut.2006 Life Event__ChldMvOut.2007
1                         NA                         NA
2                         NA                         NA
3                         NA                         NA
4                         NA                         NA
5                         NA                         NA
6                         NA                         NA
  Life Event__ChldMvOut.2008 Life Event__ChldMvOut.2009
1                         NA                         NA
2                         NA                         NA
3                         NA                         NA
4                         NA                         NA
5                         NA                         NA
6                         NA                         NA
  Life Event__ChldMvOut.2010 Life Event__ChldMvOut.2011
1                         NA                         NA
2                         NA                         NA
3                         NA                         NA
4                         NA                         NA
5                         NA                         NA
6                         NA                         NA
  Life Event__ChldMvOut.2012 Life Event__ChldMvOut.2013
1                         NA                         NA
2                         NA                         NA
3                         NA                         NA
4                         NA                         NA
5                         NA                         NA
6                         NA                         NA
  Life Event__ChldMvOut.2014 Life Event__ChldMvOut.2015
1                         NA                         NA
2                         NA                         NA
3                         NA                         NA
4                         NA                         NA
5                         NA                         NA
6                         NA                         NA
  Life Event__Divorce.2005 Life Event__Divorce.2006
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__Divorce.2007 Life Event__Divorce.2008
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__Divorce.2009 Life Event__Divorce.2010
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__Divorce.2011 Life Event__Divorce.2012
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__Divorce.2013 Life Event__Divorce.2014
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__Divorce.2015 Life Event__DadDied.2005
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__DadDied.2006 Life Event__DadDied.2007
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__DadDied.2008 Life Event__DadDied.2009
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__DadDied.2010 Life Event__DadDied.2011
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__DadDied.2012 Life Event__DadDied.2013
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__DadDied.2014 Life Event__DadDied.2015
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__NewPart.2011 Life Event__NewPart.2012
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__NewPart.2013 Life Event__NewPart.2014
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__NewPart.2015 Life Event__Married.2005
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__Married.2006 Life Event__Married.2007
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__Married.2008 Life Event__Married.2009
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__Married.2010 Life Event__Married.2011
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__Married.2012 Life Event__Married.2013
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__Married.2014 Life Event__Married.2015
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__MomDied.2005 Life Event__MomDied.2006
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__MomDied.2007 Life Event__MomDied.2008
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__MomDied.2009 Life Event__MomDied.2010
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__MomDied.2011 Life Event__MomDied.2012
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__MomDied.2013 Life Event__MomDied.2014
1                       NA                        1
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__MomDied.2015 Life Event__MoveIn.2005
1                        1                      NA
2                       NA                      NA
3                       NA                      NA
4                       NA                      NA
5                       NA                      NA
6                       NA                      NA
  Life Event__MoveIn.2006 Life Event__MoveIn.2007
1                      NA                      NA
2                      NA                      NA
3                      NA                       1
4                      NA                      NA
5                      NA                      NA
6                      NA                       1
  Life Event__MoveIn.2008 Life Event__MoveIn.2009
1                      NA                      NA
2                      NA                      NA
3                      NA                      NA
4                      NA                      NA
5                      NA                      NA
6                      NA                      NA
  Life Event__MoveIn.2010 Life Event__MoveIn.2011
1                      NA                      NA
2                      NA                      NA
3                      NA                      NA
4                      NA                      NA
5                      NA                      NA
6                      NA                      NA
  Life Event__MoveIn.2012 Life Event__MoveIn.2013
1                      NA                      NA
2                      NA                      NA
3                      NA                      NA
4                      NA                      NA
5                      NA                      NA
6                      NA                      NA
  Life Event__MoveIn.2014 Life Event__MoveIn.2015
1                      NA                      NA
2                      NA                      NA
3                      NA                      NA
4                      NA                      NA
5                      NA                      NA
6                      NA                      NA
  Life Event__PartDied.2005 Life Event__PartDied.2006
1                        NA                        NA
2                        NA                        NA
3                        NA                        NA
4                        NA                        NA
5                        NA                        NA
6                        NA                        NA
  Life Event__PartDied.2007 Life Event__PartDied.2008
1                        NA                        NA
2                        NA                        NA
3                        NA                         1
4                        NA                        NA
5                        NA                        NA
6                        NA                        NA
  Life Event__PartDied.2009 Life Event__PartDied.2010
1                        NA                        NA
2                        NA                        NA
3                         1                        NA
4                        NA                        NA
5                        NA                        NA
6                        NA                        NA
  Life Event__PartDied.2011 Life Event__PartDied.2012
1                        NA                        NA
2                        NA                        NA
3                        NA                        NA
4                        NA                        NA
5                        NA                        NA
6                        NA                        NA
  Life Event__PartDied.2013 Life Event__PartDied.2014
1                        NA                        NA
2                        NA                        NA
3                        NA                        NA
4                        NA                        NA
5                        NA                        NA
6                        NA                        NA
  Life Event__PartDied.2015 Life Event__SepPart.2005
1                        NA                       NA
2                        NA                       NA
3                        NA                       NA
4                        NA                       NA
5                        NA                       NA
6                        NA                       NA
  Life Event__SepPart.2006 Life Event__SepPart.2007
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__SepPart.2008 Life Event__SepPart.2009
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__SepPart.2010 Life Event__SepPart.2011
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__SepPart.2012 Life Event__SepPart.2013
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA
  Life Event__SepPart.2014 Life Event__SepPart.2015
1                       NA                       NA
2                       NA                       NA
3                       NA                       NA
4                       NA                       NA
5                       NA                       NA
6                       NA                       NA

Bring the wide dataset to long

soep_long <- soep %>%
  pivot_longer(cols = c(-contains("Procedural"), -contains("Demographic")),
               names_to = "item", 
               values_to = "value", 
               values_drop_na = TRUE)
paged_table(soep_long) 

Bring in the codebook relevant items for reverse coding

soep_long <- soep %>%
  pivot_longer(cols = c(-contains("Procedural"), -contains("Demographic")),
               names_to = "item", 
               values_to = "value", 
               values_drop_na = TRUE) %>%
  left_join(codebook %>% select(item = new_name, reverse, mini, maxi)) 
soep_long
# A tibble: 471,722 x 9
   Procedural__SID Procedural__hou… Demographic__DOB Demographic__Sex
             <dbl>            <dbl>            <dbl>            <dbl>
 1             901               94             1951                2
 2             901               94             1951                2
 3             901               94             1951                2
 4             901               94             1951                2
 5             901               94             1951                2
 6             901               94             1951                2
 7             901               94             1951                2
 8             901               94             1951                2
 9             901               94             1951                2
10             901               94             1951                2
# … with 471,712 more rows, and 5 more variables: item <chr>,
#   value <dbl>, reverse <int>, mini <int>, maxi <int>

Here we want to break our item column up into different components to assist with different calculations. Often you will have some sort of heirachy of variables where items are nested within scales which are nested within questionnaires. This is where you can code that information.

soep_long <- soep %>%
  pivot_longer(cols = c(-contains("Procedural"), -contains("Demographic")),
               names_to = "item", 
               values_to = "value", 
               values_drop_na = TRUE) %>%
  left_join(codebook %>% select(item = new_name, reverse, mini, maxi)) %>% 
  separate(item, c("type", "item"), sep = "__") %>% 
  separate(item, c("item", "year"), sep = "[.]") %>% 
  separate(item, c("trait", "item"), sep = "_") 
paged_table(soep_long)

now it is ready to reverse code!

soep_long <- soep %>%
  pivot_longer(cols = c(-contains("Procedural"), -contains("Demographic")),
               names_to = "item", 
               values_to = "value", 
               values_drop_na = TRUE) %>%
  left_join(codebook %>% select(item = new_name, reverse, mini, maxi)) %>% 
  separate(item, c("type", "item"), sep = "__") %>% 
  separate(item, c("item", "year"), sep = "[.]") %>% 
  separate(item, c("trait", "item"), sep = "_") %>% 
  mutate(value = as.numeric(value), # change to numeric
         value = ifelse(reverse == -1, 
            reverse.code(-1, value, mini = mini, maxi = maxi), value))
paged_table(soep_long)

Create Composites

Now that we have reverse coded our items, we can create composites.

BFI-S

We’ll start with our scale – in this case, the Big 5 from the German translation of the BFI-S.

The “simplest” way, which is also the longest way because you’d have to do it for each scale, in each year is to use a function like rowMeans which I don’t recommend as that will be MANY MANY lines of code.

soep$C.2005 <- with(soep, rowMeans(cbind(`Big 5__C_thorough.2005`, 
`Big 5__C_lazy.2005`, 
`Big 5__C_efficient.2005`), 
na.rm = T)) 

soep <- soep %>% select(-C.2005) 

We can use our codebook and dplyr to make our lives a whole lot easier. In general, trying to run everything simultanously saves from copy-paste errors, makes your code more readable, and reduces the total amount of code. So while the below code may not make intuiative sense immediately, it is nonetheless what we are working towards. Also, going through line by line will help you see that.

First, make sure we are only working with Big Five rows. Notice how once we filter the row size decreases.

b5_soep_long <- soep_long %>%
  filter(type == "Big 5")
paged_table(b5_soep_long)
b5_soep_long<- soep_long %>%
  filter(type == "Big 5") %>% 
  group_by(Procedural__SID, trait, year) %>% 
  summarize(value = mean(value, na.rm = T)) 
paged_table(b5_soep_long)

Now that we have our means we can bring the demographic info back into the dataframe…or whatever else you would want to bring in.

b5_soep_long <- soep_long %>%
  filter(type == "Big 5") %>% 
  group_by(Procedural__SID, trait, year) %>% 
  summarize(value = mean(value, na.rm = T)) %>% 
  ungroup() %>% 
  left_join(soep_long %>% 
    select(Procedural__SID, DOB = Demographic__DOB, Sex = Demographic__Sex) %>%
    distinct())
paged_table(b5_soep_long) 

Descriptives

Descriptives of your data are incredibly important. They are your first line of defense against things that could go wrong later on when you run inferential stats. They help you check the distribution of your variables (e.g. non-normally distributed), look for implausible values made through coding or participant error, and allow you to anticipate what your findings will look like.

There are lots of ways to create great tables of descriptives. My favorite way is using dplyr, but we will save that for a later lesson on creating great APA style tables in R. For now, we’ll use a wonderfully helpful function from the psych package called describe() in conjunction with a small amount of tidyr to reshape the data.

metric variables

paged_table(b5_soep_long)
b5_soep_long_des <- b5_soep_long %>%
  unite(tmp, trait, year, sep = "_") 
paged_table(b5_soep_long_des)
b5_soep_long_des <- b5_soep_long %>%
  unite(tmp, trait, year, sep = "_") %>%  
  pivot_wider(names_from = tmp, values_from = value) 
paged_table(b5_soep_long_des)
b5_soep_long_des <- b5_soep_long  %>%
  unite(tmp, trait, year, sep = "_") %>% 
  pivot_wider(names_from =tmp, values_from = value) %>% 
  describe(.) 
paged_table(b5_soep_long_des)

count variables

We have life event variable in the dataset that is a count variable. It asks did someone experience a life event during the previous year. also want to create a variable that indexes whether our participants experienced any of the life events during the years of interest (2005-2015).

events_long  <-soep_long %>%
  filter(type == "Life Event") 
paged_table(events_long )
events_long <- soep_long %>%
  filter(type == "Life Event") %>% 
  group_by(Procedural__SID, trait) %>% 
  summarize(value = sum(value, na.rm = T),value = ifelse(value > 1, 1, 0))
paged_table(events_long )

For count variables, like life events, we need to use something slightly different. We’re typically more interested in counts – in this case, how many people experienced each life event in the 10 years we’re considering?

To do this, we’ll use a little bit of dplyr rather than the base R function table() that is often used for count data. Instead, we’ll use a combination of group_by() and n() to get the counts by group. In the end, we’re left with a nice little table of counts.

events_long %>%
  group_by(trait, value) %>% 
  summarize(N = n()) 
# A tibble: 20 x 3
# Groups:   trait [10]
   trait     value     N
   <chr>     <dbl> <int>
 1 ChldBrth      0  1600
 2 ChldBrth      1   735
 3 ChldMvOut     0  1555
 4 ChldMvOut     1   830
 5 DadDied       0   953
 6 DadDied       1   213
 7 Divorce       0   414
 8 Divorce       1   122
 9 Married       0  1646
10 Married       1   331
11 MomDied       0   929
12 MomDied       1   219
13 MoveIn        0  1403
14 MoveIn        1   419
15 NewPart       0  1207
16 NewPart       1   420
17 PartDied      0   402
18 PartDied      1    76
19 SepPart       0  1172
20 SepPart       1   415
events_long %>%
  group_by(trait, value) %>% 
  summarize(N = n()) %>%
  ungroup() %>%
  pivot_wider(names_from = value, values_from = N)
# A tibble: 10 x 3
   trait       `0`   `1`
   <chr>     <int> <int>
 1 ChldBrth   1600   735
 2 ChldMvOut  1555   830
 3 DadDied     953   213
 4 Divorce     414   122
 5 Married    1646   331
 6 MomDied     929   219
 7 MoveIn     1403   419
 8 NewPart    1207   420
 9 PartDied    402    76
10 SepPart    1172   415

Zero-Order Correlations

Finally, we often want to look at the zero-order correlation among study variables to make sure they are performing as we think they should.

To run the correlations, we will need to have our data in wide format, so we’re going to do a little bit of reshaping before we do.

b5_soep_long %>%
  unite(tmp, trait, year, sep = "_") %>%
  pivot_wider(names_from = tmp, values_from = value) %>% 
  select(-Procedural__SID) %>%
  cor(., use = "pairwise") %>%
  round(., 2)
         DOB   Sex A_2005 A_2009 A_2013 C_2005 C_2009 C_2013 E_2005
DOB     1.00  0.00  -0.08  -0.07  -0.06  -0.13  -0.12  -0.14   0.10
Sex     0.00  1.00   0.18   0.17   0.18   0.05   0.07   0.09   0.08
A_2005 -0.08  0.18   1.00   0.50   0.50   0.32   0.20   0.19   0.10
A_2009 -0.07  0.17   0.50   1.00   0.55   0.19   0.28   0.18   0.05
A_2013 -0.06  0.18   0.50   0.55   1.00   0.18   0.19   0.29   0.04
C_2005 -0.13  0.05   0.32   0.19   0.18   1.00   0.52   0.48   0.19
C_2009 -0.12  0.07   0.20   0.28   0.19   0.52   1.00   0.55   0.12
C_2013 -0.14  0.09   0.19   0.18   0.29   0.48   0.55   1.00   0.13
E_2005  0.10  0.08   0.10   0.05   0.04   0.19   0.12   0.13   1.00
E_2009  0.12  0.08   0.06   0.08   0.06   0.10   0.16   0.14   0.61
E_2013  0.10  0.11   0.04   0.04   0.07   0.10   0.10   0.18   0.59
N_2005  0.06 -0.18   0.10   0.06   0.02   0.09   0.06   0.03   0.18
N_2009  0.03 -0.22   0.07   0.09   0.03   0.06   0.08   0.05   0.13
N_2013  0.02 -0.21   0.06   0.06   0.10   0.04   0.06   0.08   0.10
O_2005  0.11  0.06   0.12   0.09   0.07   0.17   0.12   0.08   0.40
O_2009  0.10  0.05   0.05   0.11   0.07   0.06   0.14   0.08   0.26
O_2013  0.05  0.07   0.08   0.09   0.13   0.07   0.08   0.15   0.24
       E_2009 E_2013 N_2005 N_2009 N_2013 O_2005 O_2009 O_2013
DOB      0.12   0.10   0.06   0.03   0.02   0.11   0.10   0.05
Sex      0.08   0.11  -0.18  -0.22  -0.21   0.06   0.05   0.07
A_2005   0.06   0.04   0.10   0.07   0.06   0.12   0.05   0.08
A_2009   0.08   0.04   0.06   0.09   0.06   0.09   0.11   0.09
A_2013   0.06   0.07   0.02   0.03   0.10   0.07   0.07   0.13
C_2005   0.10   0.10   0.09   0.06   0.04   0.17   0.06   0.07
C_2009   0.16   0.10   0.06   0.08   0.06   0.12   0.14   0.08
C_2013   0.14   0.18   0.03   0.05   0.08   0.08   0.08   0.15
E_2005   0.61   0.59   0.18   0.13   0.10   0.40   0.26   0.24
E_2009   1.00   0.65   0.10   0.16   0.10   0.29   0.36   0.28
E_2013   0.65   1.00   0.11   0.13   0.15   0.26   0.28   0.35
N_2005   0.10   0.11   1.00   0.55   0.53   0.09   0.08   0.06
N_2009   0.16   0.13   0.55   1.00   0.60   0.06   0.07   0.07
N_2013   0.10   0.15   0.53   0.60   1.00   0.05   0.05   0.05
O_2005   0.29   0.26   0.09   0.06   0.05   1.00   0.58   0.55
O_2009   0.36   0.28   0.08   0.07   0.05   0.58   1.00   0.61
O_2013   0.28   0.35   0.06   0.07   0.05   0.55   0.61   1.00